Medical transformer for multimodal survival prediction in intensive care: integration of imaging and non-imaging data

When clinicians assess the prognosis of patients in intensive care, they take imaging and non-imaging data into account. In contrast, many traditional machine learning models rely on only one of these modalities, limiting their potential in medical applications. This work proposes and evaluates a transformer-based neural network as a novel AI architecture that integrates multimodal patient data, i.e., imaging data (chest radiographs) and non-imaging data (clinical data). We evaluate the performance of our model in a retrospective study with 6,125 patients in intensive care. We show that the combined model (area under the receiver operating characteristic curve [AUROC] of 0.863) is superior to the radiographs-only model (AUROC = 0.811, p < 0.001) and the clinical data-only model (AUROC = 0.785, p < 0.001) when tasked with predicting in-hospital survival per patient. Furthermore, we demonstrate that our proposed model is robust in cases where not all (clinical) data points are available.


Results
Characteristics of the dataset. Within the MIMIC-IV dataset 21 , 6125 patients had chest radiographs and clinical parameters, resulting in 6,798 bedside chest radiographs with corresponding clinical parameters (see Fig. 1). At the time of recording, patient age ranged from 18 to 91 years with a mean of 64 years ± 16 [standard deviation]. To preserve anonymity, all patients older than 89 years had been assigned the age of 91 years by the dataset providers. Of all patients, 55% (n = 3382) were male and 45% (n = 2743) were female. A total of n = 1,002 patients died in the hospital. A detailed description of the data is given in Table 1. Table 2  In all metrics, training on chest radiographs only tended towards better performance than training on clinical parameters only. Nevertheless, statistical significance was only found for specificity (p = 0.02), while the other statistical measures were not significantly different (AUROC, p = 0.14; sensitivity, p = 0.41; positive predictive value, p = 0.14). Exemplary images for correct and incorrect model predictions are given in Fig. 3. By trend, the combined model could correctly predict survival even when the unimodal models were contradictory in their predictions, e.g., when the radiograph was largely inconspicuous. Variable pulmonary opacifications and pleural effusions were noted in false negative and false positive predictions. Additional results can be found in Supplementary Fig. S1.

Results of MeTra model training on unimodal data only.
MeTra can be trained on multimodal data. When trained on both chest radiographs and clinical parameters, MeTra reached an AUROC value of 0.863 [0.835, 0.889], which was superior to both unimodal training settings (p < 0.001). Similarly, specificity (0.861 [0.841, 0.880], p < 0.001) and positive predictive value (0.486 [0.432, 0.541], p < 0.001) were significantly higher after multimodal training than after unimodal training (Fig. 2). Sensitivity was higher, too, yet not statistically significant (0.732 [0.670, 0.792], vs. unimodal (chest radiographs only) = 0. 33, vs. unimodal ( www.nature.com/scientificreports/ of reduced parameter availability (Fig. 4). Intentionally, we included the clinical parameters Glasgow Coma Scale (total) and capillary refill rate even though their content was empty for all the test samples. The upheld performance demonstrates robustness to the fact that labels might be missing a priori.

Figure 1.
Visualization of the data extraction pipeline. For training, we only make use of those patients who were admitted to the Intensive Care Unit (n = 53,150) and who had clinical data (clinical parameters-CP) with matching chest radiographs available (n = 6,125). The data is split into the training (n = 4,396 patients), validation (n = 472 patients), and test sets (n = 1,257 patients).

Discussion
In this work, we developed and evaluated the medical transformer architecture MeTra to integrate imaging and non-imaging data for survival predictions in patients in critical care. While MeTra can predict the survival of critically ill patients when trained on clinical data or imaging data exclusively, the model can combine both data sources for improved model predictions. We also demonstrate that MeTra can deal with missing data and that there is a smooth transition between high diagnostic accuracy when all data is available to reduced diagnostic accuracy when data are missing. Consequently, MeTra may be considered a blueprint for how to utilize multimodal medical data in AI models.
Other groups have worked on survival prediction without transformer architectures and only achieved comparable performance when training on considerably more data and using extensive hyperparameter tuning ( Table 3). The present study is the first to investigate the performance of a fully transformer-based architecture in the survival prediction of patients in intensive care and proves its viability when handling imaging and nonimaging data. However, alternative transformer-based approaches have been introduced to the medical domain. Zheng et al. used the attention mechanism of transformers in combination with a graph-based method to model patient relations and utilize modality-specific data 22 . Our study distinguishes itself by eliminating the need for more complex fusion mechanisms. Song et al. used transformers to combine optical coherence tomography images and visual field exams to diagnose glaucoma 23 . The data had to be presented in matrix view, which allowed the authors to tailor their architecture to the available format. The authors also resorted to a CNN for feature extraction prior to employing the transformer for modality fusion. This approach seems unsuitable for our clinical question that aims to combine non-imaging data, such as laboratory values (typically not available in matrix view), with imaging data. Moreover, using an additional CNN does not align with our objective of implementing a purely transformer-based model. Nguyen et al. introduced the CLIMAT (Clinically-Inspired Multi-Agent Transformers) model as a fully-transformer-based model for predicting the progression of knee osteoarthritis using imaging and non-imaging data 24 . The authors used three distinct transformer modules to (i) extract features from imaging data, (ii) extract features from non-imaging data, and (iii) combine the extracted features to provide a set of output predictions, where each corresponds to the disease severity at a specific point in time. While conceptionally, the authors followed a similar approach in using transformer blocks exclusively, the different clinical question necessitates architectural distinctions. In the CLIMAT model, multiple class tokens are added to the last transformer module to extract predictions for multiple time steps. Furthermore, a compressed representation of the non-imaging features is used and concatenated to each output token of the imaging-specific transformer module before the tokens are fed to the final transformer module. In contrast, we intentionally did not compile the non-imaging data before the multimodal data fusion to ensure that all information is visible to the model. Moreover, to make sure that each imaging token attends to all non-imaging tokens and vice versa, we feed the joint set of features as tokens through the last transformer module.
Beyond, our work is clinically and scientifically relevant in several aspects: First, our clinical experience teaches us that any predictive model used clinically must deal with missing data. Not all patients are treated and diagnosed equally, and the diagnostic toolset-from imaging to laboratory studies to clinical tests-is not consistently applied to all patients. The resultant data inconsistency and scarcity are problems for conventional machine learning models since the number of patients with "complete" datasets for training is inherently limited. MeTra solves this problem as it can both be trained on incomplete data and can also deal with missing data during inference. Table 2. Overview of the clinical parameters used in conjunction with the chest radiographs. The column "Missing (%)" denotes the percentage of samples in the dataset that did not have any entry for this item. www.nature.com/scientificreports/ Second, medical diagnosis is based on data from various sources: Medical doctors assess radiographs in conjunction with laboratory values, clinical tests, and history findings, among others. Developing machine learning models that rival human expertise will eventually require including data from all these sources. MeTra suggests one possible path forward by providing an architecture encompassing data from any source. Flexible data integration into the model is a beneficial feature of the transformer architecture that contrasts with other state-of-the-art network architectures such as CNNs. CNNs are specifically designed to work well on images and -even though possible-including non-imaging data remains challenging 25,26 .
Third, an improved survival prediction in intensive care can help assess illness severity and direct intensive care where needed to save lives and improve outcomes 3 . As detailed above, MeTra achieves state-of-the-art To determine discrimination thresholds, the operating point was determined by maximizing Youden's criterion (sensitivity + specificity), resulting in specific values for the positive predictive value (c), sensitivity (d), and specificity (e). The combined model performed superior to the uni-modal models for every metric.  14 , and this information is published with the MeTra model itself.
Previous research has utilized ensembles of conventional machine learning algorithms 3 , CNNs in conjunction with attention mechanisms 27 , or recurrent neural networks 14 to predict patient survival. By comparison, the transformer architecture employed in MeTra has several advantages: It employs the same backbone architecture as the Vision Transformer 12 and upholds its advantages in incorporating global information at shallow layers while being more robust to adversarial attacks than CNNs 28 .
Our work has limitations: first, the survival prediction and validation data originate from a single center due to the unique availability of imaging and non-imaging data alongside survival data. Consequently, no external validation was performed, and the model's generalizability remains to be confirmed using multimodal datasets  www.nature.com/scientificreports/ from other institutions and through other researchers. However, we hope our work stimulates collective efforts to assemble comparable large-scale databases. Perspectively, collective work on transformer models may be accelerated further by decentralized peer-to-peer collaborations, for example, using a swarm learning approach 29 . Second, we only included relatively basic physiologic measures used for patient monitoring, while more complex measures of hemodynamics, oxygen metabolism, and microcirculation were not considered. Third, because the number of deaths in the ICU was unbalanced, the resultant class imbalance is an issue that needs consideration. Future work may address the class imbalance during training, for example, by including a weight factor into the loss function (accounting for the class imbalance) or by oversampling the underrepresented class 30 . Additionally, a hybrid approach of transformer layers and a CNN backbone may be used to further improve the performance 31 .
A more comprehensive analysis of hyperparameter choices could also be performed, e.g., the choice of vision dropout. Future studies should investigate the association between specific vision dropout settings and model performance. Fourth, the clinical dataset had missing data, and any imputation may introduce bias, increase the variability of the model's performance, and affect the results. On scientific grounds, we intentionally used the same (inconsistent) impute values as other groups to compare our MeTra model to their models. A more systematic approach would benefit and result in more robust models. On clinical grounds, a thorough analysis of the model's performance regarding missing and spurious data is required before deployment and use in the clinic. Specifically, excluding clinical parameter values by zero-tokens may lead to distribution shifts and impaired prediction performance. While we account for the distribution shifts through dropout layers in the model architecture of MeTra, future work should explore alternative methods to exclude zero-masked tokens from the input (for example, as introduced by He et al. 32 ). Adopting their approach would involve masking out missing clinical events at specific time points that are fed into the model individually. However, the computational burden caused by the quadratic scaling and associated memory requirements should be considered. Fifth, when interpreting our results in the context of the pertinent literature, it is essential to realize that the referenced results of other groups' models only indicate the range of potential outcomes. A more thorough comparison would require strict standardization of all aspects, i.e., the models would have to be trained on the same data, and the data processing pipeline would have to be identical with a fixed random seed for augmentations. Sixth, another limitation relates to the variable time difference between imaging and non-imaging data. The nonimaging (clinical) data were collected during the first 48 h after a patient had been admitted to the ICU. In contrast, the last chest radiograph acquired during a patient's ICU stay was included as the (paired) imaging data 14 . In the patient subpopulation of the MIMIC dataset that was included in our study (for whom clinical www.nature.com/scientificreports/ parameters and chest radiographs were available), patients had an average ICU stay length of 5.4 ± 4.9 d (range 1.1-99.6 d [n = 6125 patients]). In our clinical experience, ICU stay lengths are affected by admission diagnosis, patient demographics, constitution, comorbidities, complications, type of treatment, and others, which affect the variability of associated clinical parameters. Consequently, the substantial time difference outlined above is worth considering when drawing clinical conclusions. For any meaningful clinical insights, more specific clinical questions need to be asked, more refined patient populations need to be studied, and more fine-granular analyses need to be conducted. In addition, mortality may be determined by a range of conditions with limited bearings on the chest radiograph, which is inherently limited in differentiating pathologic processes characterized by similar radiographic changes, e.g., pulmonary opacifications 33 . In the clinic, the availability of clinical parameters aids in interpreting equivocal findings on chest radiographs and vice versa. Therefore, our findings of significantly improved survival predictions based on imaging and non-imaging data become clinically plausible, yet the real clinical benefit remains to be determined.
In conclusion, we developed and validated a multimodal medical transformer model that can be easily trained without specifically tweaking the architecture for specific input modalities and exhibits robustness to missing and heterogeneous data. We achieved excellent performance in the survival prediction of patients in critical care. We also make our model an open source for clinicians and researchers as a benchmark model on a well-defined dataset.

Online methods
Study design. Following approval by the local ethical committee (Reference No. 028/19), this retrospective study followed local data protection regulations. All networks were trained on publicly available datasets described below and tested for their performance in predicting the survival of patients in intensive care.

Description of dataset. The MIMIC-IV (Medical Information Mart for Intensive Care) dataset is a large
US database of retrospectively collected data from two in-hospital database systems: a custom hospital-wide EHR and an ICU-specific clinical information system. The MIMIC-IV dataset contains EHR data and is linked to the MIMIC-chest-X ray (MIMIC-CXR) database, which provides the corresponding imaging data of the same patients 21,34 . All data is publicly available via physionet 35 . For full transparency and optimal comparability, we have used the same training test splits as other groups 14 , and we publish this split alongside the model. Table 1 provides a detailed description of the dataset. Data preprocessing. The imaging and non-imaging data were extracted from the MIMIC database and preprocessed as described by Hayat et al. 14 (Fig. 1). In detail, a subset of the MIMIC data was compiled, containing millions of clinical events corresponding to 17 clinical parameters ( Table 2). Of these, the capillary refill rate and Glasgow Coma Scale (total) were missing for all patients and, thus, disregarded from our analysis, leaving 15 clinical parameters to be included in the model. The chest radiographs (obtained as anterior-posterior projections) from the MIMIC-CXR database were extracted and matched to the EHR data. The chest radiographs were first normalized to match the dataset statistics of ImageNet 36 (in terms of means and standard deviations) and resized to a resolution of 384 × 384 to use pre-trained models (see below). Data were split into training (72%), validation (8%), and test (20%) data using patient-wise stratification but otherwise random allocation.
The multimodal medical transformer architecture. Building on the transformer architecture proposed by Vaswani et al. 13 , which was subsequently extended for use in vision problems 12 , we designed our medical transformer model to provide a direct way to incorporate imaging and non-imaging data into the learning process. Principally, as data inside transformer models is processed in tokens, there are no restrictions for its application on other modalities. More precisely, MeTra takes input data from two different modalities. Chest radiographs x C×R ∈ R H×W of image height H and width W were first processed by a vision backbone to extract high-level image features z C×R ∈ R N×D that could be fused with the data of other modalities later. Here, N denotes the number of tokens and D denotes the dimensionality of the latent representation for each token. Any vision transformer model can be used for this task, thus allowing us to leverage models pre-trained on different datasets. In particular, MeTra uses a Vision Transformer (ViT) 12 with a patch size of 16 that has been pre-trained on ImageNet without the final classification head as its backbone. Additionally, clinical parameters retrieved from the EHRs x CP ∈ R K×T are projected into the latent representation z CP ∈ R M×D using a linear layer to match the dimensionality D of the image tokens. Here, K denotes the number of EHR items and T denotes the number of recorded time steps for each item. We set T to 48 in all experiments, representing the values of the respective item for each hour within the first 48 h of patient admission to the ICU. A missing value is imputed by setting it to the most recent measurement value if available or by setting it to a pre-specified value ( Table 2) as suggested by Harutyunyan et al. 37 . To fuse imaging and non-imaging data efficiently, the latent representations of both backbones are concatenated to form the latent representation z MULTI ∈ R (N+M)×D . The self-attention mechanism used inside transformers to process the input sequence does not consider the order of the elements in the sequence. To address this issue, we define a set of N + M learnable tokens of dimension D that are added element-wise to the latent representation z MULTI . Subsequently, a learnable class token CLS is prepended to z MULTI , and the resulting multimodal representation is processed with a transformer encoder, where the multihead self-attention layers 13 allow cross-modality information transfers. A multi-layer perceptron with a Sigmoid activation function is applied to the output to form the final prediction p SURVIVAL that quantifies the likelihood of in-hospital survival of the patient. The MeTra architecture is visualized in Fig. 5.
We trained three variants to compare the different modalities' influence on the models' final performance. The model only using the clinical parameters as retrieved from the EHR ("clinical parameters only-model") was Similarly, for the corresponding model that only used the chest radiographs for predictions ("radiographs-only model"), the clinical parameters x CP were set to zero. Finally, the combined model was trained by resuming the training routine from the checkpoint of the clinical parameters only-model with the highest area under the receiver operating characteristic curve (AUROC) value on the validation set (which is different from the test set). Motivated by preliminary findings [not shown] that indicated severe disbalance in the model's focus and substantial disregard of the non-imaging data when trained on imaging and non-imaging data at once, we modified the training strategy of the combined model as follows: The imaging information was excluded during initial training and only provided (alongside the non-imaging information) during the subsequent training steps. Consequently, the combined model uses a similar setting as the unimodal models, i.e., starting from the same initial random states, but applying a full dropout of the imaging information during the initial epochs of training. No further restrictions on the available data were made; therefore, all information present in x C×R and x CP were used. To further prevent the multimodal transformer encoder from relying exclusively on information originating from the vision backbone, all pixels in x C×R were randomly set to zero with probability p VDO (chosen to be 30% and based on preliminary studies). We coined this procedure vision dropout. The training was performed on an NVIDIA Quadro RTX 6000 for 200 epochs to guarantee the convergence of each model. As the learning objective, we minimized the binary cross-entropy loss: where y ∈ {0, 1} represents the ground truth value for survival. 1 denotes that the patient died during the hospital stay, and 0 denotes that the patient was discharged alive. We used the AdamW 38 optimizer with a learning rate of 5e − 6, which was decreased over time using the cosine annealing procedure 39 until a final learning rate of 1e − 7 was reached. The entire code was written using Python v3.8, and MeTra was implemented using PyTorch v1.11.0. For more information regarding our training procedure, please refer to Supplementary Table S1.
Description of experiments. In the first experiment, the model was trained only on the clinical parameters and subsequently evaluated with these data as exclusive input. Similarly, the clinical parameter items are fed through the clinical backbone, where they are projected to an embedding space with a dimensionality that matches that of the image embeddings. In the next step, a learnable position encoding token is added to the embeddings of both modalities. Finally, the modalities are fused by processing the embeddings with a transformer encoder that applies multi-head self-attention to all input tokens, thus allowing cross-modality information transfer. A multilayer-perceptron (MLP) is applied to the output to form the final prediction for in-hospital survival. www.nature.com/scientificreports/ In the second experiment, the model was trained only on the imaging data and evaluated with these data only.
In the third experiment, the model was trained on all data and evaluated using all data. The combined model (third experiment) was provided with the full imaging data set but only parts of the clinical parameters as input to study how missing data impact its performance. In detail, this experiment was repeated 100 times with 2, 4, 6, 8, 10, 12, and 14 clinical parameters set to "missing" each time. Missing parameters were chosen randomly within each of the 100 runs to prevent bias in choosing variables.
We evaluated the AUROC, AUPRC, sensitivity, specificity, and positive predictive value for all experiments.
Statistical analysis. Statistical analyses were conducted using Python v3.8 with its libraries NumPy and SciPy. Bootstrapping was employed with 10,000 redraws for each measure to determine the statistical spread and calculate p-values for differences 40 . For calculating sensitivity and specificity, a threshold was chosen according to Youden's criterion 41 , i.e., a threshold that maximized (sensitivity + specificity). We included all patients for which both radiographs and clinical parameters were available.

Data availability
All data, including imaging and non-imaging data, is publicly available from the MIMIC database 21,34 on PhysioNet (for MIMIC-IV, see https:// physi onet. org/ conte nt/ mimic iv/1. 0/. and for MIMIC-CXR-JPG, see https:// physi onet. org/ conte nt/ mimic-cxr-jpg/2. 0.0/). The code to extract the chest radiographs and corresponding clinical parameters can be found in the GitHub repository linked in the code availability section.

Code availability
The entire code is publicly available on GitHub via https:// github. com/ Firas Git/ MeTra. We also provide detailed information on all data splits into training and testing for other groups to compare their algorithms with ours.